Heart Disease Risk Factor Analysis Dashboard

Exploring relationships between health indicators and heart disease

Author

Joe Valente

Published

March 2, 2025

Overview

This dashboard analyzes a heart disease dataset from Kaggle, examining the relationships between various health indicators, lifestyle factors, and heart disease status. The analysis aims to identify key risk factors and patterns that may contribute to heart disease.

Data Source: Heart Disease Dataset on Kaggle

Program: Master of Science in Data Science, University of Colorado Boulder

Note

This analysis used a random sample (50%) the original dataset for visualization purposes with Altair. For reproducibility, the sample is set to a random state of 0 in Pandas. df.sample(frac=0.50, random_state=0)

Summary Statistics

The table below provides summary statistics for the numerical variables in the dataset. These values give us a baseline understanding of the central tendencies and distributions of key health metrics.

Age Blood Pressure Cholesterol Level BMI Sleep Hours Triglyceride Level Fasting Blood Sugar CRP Level Homocysteine Level
count 4986.00 4989.00 4982.00 4990.00 4988.00 4994.00 4989.00 4985.00 4989.00
mean 49.57 149.79 225.35 29.01 6.99 249.28 119.92 7.48 12.44
std 18.14 17.58 43.58 6.34 1.76 87.04 23.63 4.32 4.33
min 18.00 120.00 150.00 18.00 4.00 100.00 80.00 0.01 5.01
25% 34.00 134.00 187.00 23.46 5.45 175.00 99.00 3.70 8.67
50% 50.00 150.00 226.00 29.04 6.99 248.00 120.00 7.49 12.35
75% 65.00 165.00 263.00 34.48 8.55 324.00 141.00 11.18 16.13
max 80.00 180.00 300.00 40.00 10.00 400.00 160.00 15.00 20.00

Demographic & Lifestyle Factors

This section explores the categorical variables in the dataset, including gender, diabetes status, smoking habits, and other lifestyle factors that may influence heart disease risk.

Gender and Diabetes Distribution

The charts below show the distribution of participants by gender and diabetes status. Understanding these demographic factors is important as they can significantly affect heart disease risk.

Heart Disease Status and Smoking Habits

These charts display the distribution of heart disease status in the sample and smoking habits. Smoking is a well-known risk factor for cardiovascular disease, and this visualization helps us understand its prevalence in our dataset.

Blood Pressure and Cholesterol Status

High blood pressure and high cholesterol are major risk factors for heart disease. These visualizations show the distribution of participants with and without these conditions.

Stress and Sugar Consumption

Stress and dietary factors like sugar consumption can contribute to heart disease risk. These charts show the distribution of stress levels and sugar consumption habits in our sample.

Alcohol Consumption

Alcohol consumption can have complex effects on heart health. This chart displays the distribution of alcohol consumption patterns among participants.

Important

The distribution of the categorical variables are all uniform except for the Heart Disease Status. This is seems odd if this dataset is representative of the population and a random sample was taken.

Comparing Heart Disease Groups

This section compares participants with and without heart disease to identify potential differences in health metrics and risk factors.

Summary Statistics by Heart Disease Status

The tables below show summary statistics for participants with and without heart disease. Comparing these values can help identify which metrics differ between the two groups.

With Heart Disease

Age Blood Pressure Cholesterol Level BMI Sleep Hours Triglyceride Level Fasting Blood Sugar CRP Level Homocysteine Level
count 1022.00 1021.00 1019.00 1022.00 1022.00 1022.00 1019.00 1019.00 1021.00
mean 49.32 149.71 226.46 29.07 6.98 250.20 119.80 7.36 12.43
std 18.35 17.51 43.01 6.18 1.79 87.19 23.57 4.26 4.37
min 18.00 120.00 150.00 18.01 4.00 100.00 80.00 0.01 5.02
25% 34.00 134.00 189.00 23.73 5.40 177.00 99.00 3.80 8.61
50% 49.00 150.00 227.00 29.27 6.98 245.50 119.00 7.24 12.45
75% 66.00 165.00 265.00 34.31 8.56 327.75 141.00 11.02 16.24
max 80.00 180.00 300.00 39.94 10.00 400.00 160.00 15.00 20.00

Without Heart Disease

Age Blood Pressure Cholesterol Level BMI Sleep Hours Triglyceride Level Fasting Blood Sugar CRP Level Homocysteine Level
count 3964.00 3968.00 3963.00 3968.00 3966.00 3972.00 3970.00 3966.00 3968.00
mean 49.63 149.81 225.07 28.99 7.00 249.05 119.96 7.51 12.44
std 18.09 17.60 43.73 6.38 1.75 87.01 23.65 4.34 4.32
min 18.00 120.00 150.00 18.00 4.00 100.00 80.00 0.01 5.01
25% 34.00 134.00 187.00 23.43 5.46 174.00 99.00 3.68 8.69
50% 50.00 149.00 226.00 28.98 7.00 248.00 120.00 7.55 12.33
75% 65.00 165.00 262.50 34.55 8.55 323.00 141.00 11.21 16.10
max 80.00 180.00 300.00 40.00 10.00 400.00 160.00 15.00 20.00
Note

Looking at these summary statistics, there don’t appear to be dramatic differences between groups at first glance. Further statistical analysis is needed to identify significant differences.

Correlation Analysis for Heart Disease Group

The heatmap below displays correlations between numerical variables for participants with heart disease. A correlation value ranges from -1 to 1:

  • A value close to 1 means that as one variable increases, the other tends to increase as well (positive correlation)
  • A value close to -1 means that as one variable increases, the other tends to decrease (negative correlation)
  • A value close to 0 means there is little to no relationship between the variables
  • For correlations, neither variable is considered dependent or independent since we’re measuring their mutual relationship rather than causation

For example, if age and blood pressure had a correlation of 0.8, this would suggest that blood pressure tends to increase with age. The darker colors in the heatmap indicate stronger correlations (either positive or negative).

Note

The correlation plot shows that there are no strong correlations between the variables. That is to say, the variables are not highly dependent on one another.

Health Metrics by Heart Disease Status

This section uses boxplots to compare the distribution of key health metrics between participants with and without heart disease. Boxplots provide several key insights:

  • The box shows the interquartile range (IQR) containing the middle 50% of values:
  • The line inside the box represents the median
  • The whiskers extend to show the rest of the distribution
  • Points beyond the whiskers represent potential outliers

Age, Blood Pressure, Cholesterol, BMI, and Sleep Hours

These boxplots show how age, blood pressure, cholesterol levels, BMI, and sleep hours differ between those with and without heart disease.

Triglycerides, Blood Sugar, CRP, and Homocysteine

These additional boxplots examine differences in triglyceride levels, fasting blood sugar, CRP (C-reactive protein) levels, and homocysteine levels between the two groups.

Statistical Analysis

This section performs statistical tests to determine if the observed differences between groups are statistically significant.

BMI Analysis

First, we check if BMI is normally distributed using a QQ plot, and then conduct a t-test to compare mean BMI between participants with and without heart disease.

Note

The QQ plot shows that BMI follows a roughly normal distribution, with some deviation in the tails. This suggests a t-test is appropriate for comparing means.


T-test results between Heart Disease Status for BMI:
t-statistic: -0.347
p-value: 0.728
There is no significant difference in BMI between heart disease status.

Blood Pressure Analysis

Similarly, we check the normality of blood pressure data and conduct a t-test to compare means between groups.


T-test results between Heart Disease Status for Blood Pressure:
t-statistic: 0.172
p-value: 0.864
There is no significant difference in Blood Pressure between heart disease status.

Cholesterol Level Analysis

Finally, we analyze cholesterol levels using the same approach.


T-test results between Heart Disease Status for Cholesterol Level:
t-statistic: -0.914
p-value: 0.361
There is no significant difference in Cholesterol Level between heart disease status.

Interactive Risk Factor Analysis

This section provides interactive visualizations to explore relationships between age, various health metrics, and heart disease status across different lifestyle groups.

Smoking and Heart Disease

These interactive plots allow exploration of the relationship between age, various health metrics, and heart disease status specifically for smokers and non-smokers.

Tip

How to use: Select different health metrics from the dropdown menu to see how they relate to age among smokers. Use the Heart Disease Status filter to compare those with and without heart disease.

Combined Risk Factors

These visualizations examine the combined effects of multiple risk factors, specifically smoking combined with diabetes or high alcohol consumption.

Low-Risk vs High-Risk Group Analysis

This visualization examines participants with minimal risk factors: non-smokers who do not consume alcohol. The high-risk group is smokers who consume alcohol with a family history of heart disease.

.

Tip

After you play with the plots, you may notice that the high risk group has a higher average sleep hours. We will test this hypothesis with a t-test to see if the difference is statistically significant.


T-test results between High-Risk and Low-Risk Groups for Sleep Hours:
t-statistic: -1.406
p-value: 0.163
There is no significant difference in Sleep Hours between High-Risk and Low-Risk Groups.

Key Findings

Based on the analyses performed, here are the key findings from this dataset:

  1. Significant Results: There was no statistically significant results found in the t-tests for blood pressure, cholesterol level, or BMI. Other tests could be performed to find the relationship between the variables but do to the uniform nature of the data we probably need to use more complex models.

  2. Risk Factor Combinations: The interactive visualizations suggest that combinations of risk factors (smoking with high alcohol consumption) may have compounding effects on health like fasting blood sugar levels.

  3. Age Relationships: The regression lines in the interactive plots show how the relationship between age and various health metrics differs across lifestyle groups and heart disease status.

  4. Data Distribution: The dataset is uniform as is seen in the box, scatter, and categorical plots.

Limitations and Future Work

  • The dataset is designed for machine learning and therefore has a uniform distribution of the variables to help with classification tasks and prevent over fitting. However, this makes it difficult to make any definitive conclusions about the relationship between the variables with simple statistical models.
  • Further multivariate analyses may help identify interactions between multiple risk factors.
  • Predictive modeling should be applied to determine which factors best predict heart disease status as simple statistical models are not sufficient to make any definitive conclusions.
  • Additional lifestyle and genetic factors not included in this dataset may play important roles in heart disease risk.